Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved remote cluster performance #259

Merged
merged 7 commits into from
Jul 25, 2023

Conversation

jrbourbeau
Copy link
Collaborator

This PR adds an fsspec-compatible EarthAccessFile object that's a very light wrapper around the current https / S3 file objects being used. The only difference is that when a file is serialized (as happens when working on a remote cluster), if we find that we are running in region on AWS and the underlying file is an https file instead of an S3 file, we reopen the file using the existing eathaccess.open(...) logic to get a S3 file.

Anecdotally, I see a computation performance boost of 2x when running the notebook in #253 with the changes in this PR

Closes #253

@github-actions
Copy link

github-actions bot commented Jul 8, 2023

Binder 👈 Launch a binder notebook on this branch for commit 2df5caa

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit 92e580a

Binder 👈 Launch a binder notebook on this branch for commit 4f9a89c

Copy link
Contributor

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't judge the earthaccess code well, but it's cool to see that this doesn't appear to be a crazy hack 🙂

earthaccess/store.py Outdated Show resolved Hide resolved
earthaccess/store.py Outdated Show resolved Hide resolved
@jrbourbeau jrbourbeau mentioned this pull request Jul 9, 2023
@betolink
Copy link
Member

This is amazing! @jrbourbeau I'll take a look today. On the tests failing, I have some WIP to only run integration tests on main. They are not deterministic and some datasets require EULAs, also when you run them from a fork secrets cannot be shared.

@jrbourbeau
Copy link
Collaborator Author

Okay, thanks @betolink. I figured it was something like that based of the failures over in #260. FWIW #261 should fix the documentation build

@MattF-NSIDC
Copy link

MattF-NSIDC commented Jul 25, 2023

This looks ready to merge to us; @betolink did some local testing and things worked great. We added a comment about an issue with data from the CMR API that we currently don't know how to resolve: The first "data link" is not always the "right" data link.

We're going to punt on the failing tests because Luis found the unit tests are passing, but issues with secrets are causing the integration tests to fail. Another PR for skipping integration tests on forks is coming soon :)

@jrbourbeau as this PR still has [WIP] in the title, are you planning to do more work on it? If not we can merge it tomorrow!

@mrocklin
Copy link
Contributor

@jrbourbeau is on vacation this week. Last I spoke with him I recall him saying positive things about this PR though, and saying that mostly it was waiting review. I wouldn't be surprised if he just forgot to remove the WIP.

@MattF-NSIDC MattF-NSIDC changed the title [WIP] Improved remote cluster performance Improved remote cluster performance Jul 25, 2023
@MattF-NSIDC
Copy link

Great! Thank you for your patience and contribution :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better support with remote clusters for computation
5 participants